Graph-Parallel Entity Resolution using LSH & IMM
نویسندگان
چکیده
In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison load is better distributed across processors especially in the presence of severely skewed bucket sizes. We analyze the BCP and RCP approaches analytically as well as empirically using a large synthetically generated dataset. We generalize the lessons learned from our experience and submit that the RCP approach is also applicable in many similar applications that rely on LSH or related grouping strategies to minimize pair-wise comparisons.
منابع مشابه
Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing
Given a set of records, entity resolution algorithms find allthe records referring to each entity. In this paper, we studythe problem of top-k entity resolution: finding all the recordsreferring to the k largest (in terms of records) entities. Top-kentity resolution is driven by many modern applications thatoperate over just the few most popular entities in a dataset.We ...
متن کاملParallel Privacy-Preserving Record Linkage using LSH-based blocking
Privacy-preserving record linkage (PPRL) aims at integrating person-related data without revealing sensitive information. For this purpose, PPRL schemes typically use encoded attribute values and a trusted party for conducting the linkage. To achieve high scalability of PPRL to large datasets with millions of records, we propose parallel PPRL (P3RL) approaches that build on current distributed ...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملThe Graduate School HIGH PERFORMANCE RECORD LINKAGE
In current world, the immense size of a data set makes problems in finding similar/identitcal data. In addition, the dirtiness of data, i.e. typos, missing/tilting information, and additional noises usually occurred by careless editing or entry mistakes, makes further difficulty to identify entity-belongs. Therefore, we focus on the faster detection of data referring the same real-world entity ...
متن کاملTowards a Scalable and Robust Entity Resolution -Approximate Blocking with Semantic Constraints
Entity resolution, or record linkage, is the process that identifies data records over one or more datasets which refer to the same real world entity. To deal with large datasets, many real-life applications require scalable and high-quality entity resolution techniques. Blocking techniques can help to scale-up the entity resolution process. Locality sensitive hashing (LSH) is an approximate bl...
متن کامل